This take home exercise aims to explain factors affecting resale prices of public housing in Singapore by building hedonic pricing models using appropriate GWR methods.
Housing price is affected by a variety of factors, including global factors such as the general economy of a country or property specific factors such as structural variables related to the property and locational variables related to the neighbourhood.
Hedonic pricing model is used to examine the effect of housing factors as on property price. However, this method fails to take into consideration that spatial autocorrelation and spatial heterogeneity exist in housing transaction data, which could lead to biased results (Anselin 1998). In view of this limitation, Geographical Weighted Regression (GWR) was introduced to calibrate hedonic price models for housing.
As such, we will be building hedonic pricing models to explain factors affecting the resale prices of public housing in Singapore using appropriate GWR methods.
The data sets used for this analysis include:
The packages used for this analysis include:
packages = c('olsrr', 'corrplot', 'ggpubr', 'sf', 'spdep', 'GWmodel', 'tmap', 'tidyverse', 'httr', 'units', 'matrixStats')
for (p in packages){
if(!require(p, character.only = T)){
install.packages(p)
}
library(p,character.only = T)
}
st_read() of sf package is used to import the geospatial data, which is in shapefile format.
mrt_sf <- st_read(dsn="data/geospatial",
layer="MRTLRTStnPtt")
Reading layer `MRTLRTStnPtt' from data source
`C:\Users\user\Desktop\SMU\Y4S1\IS415\geniceee\IS415_blog\_posts\2021-11-05-take-home-exercise-3\data\geospatial'
using driver `ESRI Shapefile'
Simple feature collection with 185 features and 3 fields
Geometry type: POINT
Dimension: XY
Bounding box: xmin: 6138.311 ymin: 27555.06 xmax: 45254.86 ymax: 47854.2
Projected CRS: SVY21
bus_sf <- st_read(dsn="data/geospatial",
layer="BusStop")
Reading layer `BusStop' from data source
`C:\Users\user\Desktop\SMU\Y4S1\IS415\geniceee\IS415_blog\_posts\2021-11-05-take-home-exercise-3\data\geospatial'
using driver `ESRI Shapefile'
Simple feature collection with 5156 features and 3 fields
Geometry type: POINT
Dimension: XY
Bounding box: xmin: 4427.938 ymin: 26482.1 xmax: 48282.5 ymax: 52983.82
Projected CRS: SVY21
mpsz_sf <- st_read(dsn="data/geospatial",
layer="MP14_SUBZONE_WEB_PL")
Reading layer `MP14_SUBZONE_WEB_PL' from data source
`C:\Users\user\Desktop\SMU\Y4S1\IS415\geniceee\IS415_blog\_posts\2021-11-05-take-home-exercise-3\data\geospatial'
using driver `ESRI Shapefile'
Simple feature collection with 323 features and 15 fields
Geometry type: MULTIPOLYGON
Dimension: XY
Bounding box: xmin: 2667.538 ymin: 15748.72 xmax: 56396.44 ymax: 50256.33
Projected CRS: SVY21
From the output message, we can see that:
mrt_sf sf data frame.bus_sf sf data frame.mpsz_sf sf data frame.Before we visualise the geospatial data, we will need to conduct data preprocessing to ensure that there are no invalid geometries and missing values.
[1] 0
[1] 0
[1] 9
There are no invalid geometries in both mrt_sfand bus_sf data frames while the mpsz_sf data frame contains 9 invalid geometries. We will now proceed to remove the invalid geometries in the mpsz_sf data frame.
mpsz_sf <- st_make_valid(mpsz_sf)
# Check again for invalid geometries
length(which(st_is_valid(mpsz_sf) == FALSE))
[1] 0
Simple feature collection with 0 features and 3 fields
Bounding box: xmin: NA ymin: NA xmax: NA ymax: NA
Projected CRS: SVY21
[1] OBJECTID STN_NAME STN_NO geometry
<0 rows> (or 0-length row.names)
Simple feature collection with 1 feature and 3 fields
Geometry type: POINT
Dimension: XY
Bounding box: xmin: 22616.75 ymin: 47793.68 xmax: 22616.75 ymax: 47793.68
Projected CRS: SVY21
BUS_STOP_N BUS_ROOF_N LOC_DESC geometry
264 47201 UNK <NA> POINT (22616.75 47793.68)
Simple feature collection with 0 features and 15 fields
Bounding box: xmin: NA ymin: NA xmax: NA ymax: NA
Projected CRS: SVY21
[1] OBJECTID SUBZONE_NO SUBZONE_N SUBZONE_C CA_IND PLN_AREA_N
[7] PLN_AREA_C REGION_N REGION_C INC_CRC FMEL_UPD_D X_ADDR
[13] Y_ADDR SHAPE_Leng SHAPE_Area geometry
<0 rows> (or 0-length row.names)
We can see that there is 1 observation with missing values in the bus_sf data frame. We will remove it because another bus stop of the same BUS_ROOF_N is found after further data exploration.
bus_sf <- bus_sf[!rowSums(is.na(bus_sf))!=0,]
# Check again for missing values
bus_sf[rowSums(is.na(bus_sf))!=0,]
Simple feature collection with 0 features and 3 fields
Bounding box: xmin: NA ymin: NA xmax: NA ymax: NA
Projected CRS: SVY21
[1] BUS_STOP_N BUS_ROOF_N LOC_DESC geometry
<0 rows> (or 0-length row.names)
mrt_sf[duplicated(mrt_sf$STN_NAME),]
Simple feature collection with 19 features and 3 fields
Geometry type: POINT
Dimension: XY
Bounding box: xmin: 22949.03 ymin: 28735.78 xmax: 42362.71 ymax: 46418.56
Projected CRS: SVY21
First 10 features:
OBJECTID STN_NAME STN_NO geometry
33 33 ANG MO KIO MRT STATION NS16 POINT (29807.27 39105.77)
97 105 MACPHERSON MRT STATION DT26 POINT (34235.8 34256.43)
103 111 BUGIS MRT STATION EW12 POINT (30491.56 31424.35)
114 122 EXPO MRT STATION DT35 POINT (42362.71 35285.67)
116 124 BUONA VISTA MRT STATION CC22 POINT (23251.76 32090.77)
121 129 MARINA BAY MRT STATION CE2 POINT (30423.43 28735.78)
124 132 CHINATOWN MRT STATION DT19 POINT (29238.97 29686.54)
134 142 TAMPINES MRT STATION DT32 POINT (40213.03 37463.37)
140 148 SERANGOON MRT STATION NE12 POINT (32480.07 36869.39)
144 152 PAYA LEBAR MRT STATION CC9 POINT (34550.58 33300.34)
bus_sf[duplicated(bus_sf$BUS_STOP_N),]
Simple feature collection with 13 features and 3 fields
Geometry type: POINT
Dimension: XY
Bounding box: xmin: 13488.02 ymin: 29969.22 xmax: 44144.57 ymax: 47806.75
Projected CRS: SVY21
First 10 features:
BUS_STOP_N BUS_ROOF_N LOC_DESC
350 58031 UNK OPP CANBERRA DR
1569 51071 B21 MACRITCHIE RESERVOIR
2208 82221 B01 Blk 3A
2215 97079 B14 OPP ST. JOHN'S CRES
2392 62251 B03 BEF BLK 471B
2462 22501 B02 BLK 662A
2736 16119 NIL TELETECH PARK
2976 58229 B01A BLK 358A CP
3442 67421 NIL CHENG LIM STN EXIT B
3521 68091 B08 AFT BAKER ST
geometry
350 POINT (27089.69 47570.9)
1569 POINT (28305.37 36036.67)
2208 POINT (35308.74 33335.17)
2215 POINT (44144.57 38980.25)
2392 POINT (35500.36 39943.34)
2462 POINT (13488.02 35537.88)
2736 POINT (22318.89 29969.22)
2976 POINT (26094.86 47806.75)
3442 POINT (34741.77 42004.21)
3521 POINT (32038.84 43298.68)
mpsz_sf[duplicated(mpsz_sf$SUBZONE_C),]
Simple feature collection with 0 features and 15 fields
Bounding box: xmin: NA ymin: NA xmax: NA ymax: NA
Projected CRS: SVY21
[1] OBJECTID SUBZONE_NO SUBZONE_N SUBZONE_C CA_IND PLN_AREA_N
[7] PLN_AREA_C REGION_N REGION_C INC_CRC FMEL_UPD_D X_ADDR
[13] Y_ADDR SHAPE_Leng SHAPE_Area geometry
<0 rows> (or 0-length row.names)
We can observe that there are 19 and 13 duplicated observations in the mrt_sf and bus_sf data frames respectively, while there are none in the mpsz_sf data frame. We will proceed to remove the duplicated observations.
mrt_sf <- mrt_sf[!duplicated(mrt_sf$STN_NAME),]
bus_sf <- bus_sf[!duplicated(bus_sf$BUS_STOP_N),]
# Check again for duplicates
mrt_sf[duplicated(mrt_sf$STN_NAME),]
Simple feature collection with 0 features and 3 fields
Bounding box: xmin: NA ymin: NA xmax: NA ymax: NA
Projected CRS: SVY21
[1] OBJECTID STN_NAME STN_NO geometry
<0 rows> (or 0-length row.names)
bus_sf[duplicated(bus_sf$BUS_STOP_N),]
Simple feature collection with 0 features and 3 fields
Bounding box: xmin: NA ymin: NA xmax: NA ymax: NA
Projected CRS: SVY21
[1] BUS_STOP_N BUS_ROOF_N LOC_DESC geometry
<0 rows> (or 0-length row.names)
We will first need to retrieve the coordinate reference system for verification. st_crs() of sf package is used to do this.
st_crs(mrt_sf)
Coordinate Reference System:
User input: SVY21
wkt:
PROJCRS["SVY21",
BASEGEOGCRS["SVY21[WGS84]",
DATUM["World Geodetic System 1984",
ELLIPSOID["WGS 84",6378137,298.257223563,
LENGTHUNIT["metre",1]],
ID["EPSG",6326]],
PRIMEM["Greenwich",0,
ANGLEUNIT["Degree",0.0174532925199433]]],
CONVERSION["unnamed",
METHOD["Transverse Mercator",
ID["EPSG",9807]],
PARAMETER["Latitude of natural origin",1.36666666666667,
ANGLEUNIT["Degree",0.0174532925199433],
ID["EPSG",8801]],
PARAMETER["Longitude of natural origin",103.833333333333,
ANGLEUNIT["Degree",0.0174532925199433],
ID["EPSG",8802]],
PARAMETER["Scale factor at natural origin",1,
SCALEUNIT["unity",1],
ID["EPSG",8805]],
PARAMETER["False easting",28001.642,
LENGTHUNIT["metre",1],
ID["EPSG",8806]],
PARAMETER["False northing",38744.572,
LENGTHUNIT["metre",1],
ID["EPSG",8807]]],
CS[Cartesian,2],
AXIS["(E)",east,
ORDER[1],
LENGTHUNIT["metre",1,
ID["EPSG",9001]]],
AXIS["(N)",north,
ORDER[2],
LENGTHUNIT["metre",1,
ID["EPSG",9001]]]]
st_crs(bus_sf)
Coordinate Reference System:
User input: SVY21
wkt:
PROJCRS["SVY21",
BASEGEOGCRS["WGS 84",
DATUM["World Geodetic System 1984",
ELLIPSOID["WGS 84",6378137,298.257223563,
LENGTHUNIT["metre",1]],
ID["EPSG",6326]],
PRIMEM["Greenwich",0,
ANGLEUNIT["Degree",0.0174532925199433]]],
CONVERSION["unnamed",
METHOD["Transverse Mercator",
ID["EPSG",9807]],
PARAMETER["Latitude of natural origin",1.36666666666667,
ANGLEUNIT["Degree",0.0174532925199433],
ID["EPSG",8801]],
PARAMETER["Longitude of natural origin",103.833333333333,
ANGLEUNIT["Degree",0.0174532925199433],
ID["EPSG",8802]],
PARAMETER["Scale factor at natural origin",1,
SCALEUNIT["unity",1],
ID["EPSG",8805]],
PARAMETER["False easting",28001.642,
LENGTHUNIT["metre",1],
ID["EPSG",8806]],
PARAMETER["False northing",38744.572,
LENGTHUNIT["metre",1],
ID["EPSG",8807]]],
CS[Cartesian,2],
AXIS["(E)",east,
ORDER[1],
LENGTHUNIT["metre",1,
ID["EPSG",9001]]],
AXIS["(N)",north,
ORDER[2],
LENGTHUNIT["metre",1,
ID["EPSG",9001]]]]
st_crs(mpsz_sf)
Coordinate Reference System:
User input: SVY21
wkt:
PROJCRS["SVY21",
BASEGEOGCRS["SVY21[WGS84]",
DATUM["World Geodetic System 1984",
ELLIPSOID["WGS 84",6378137,298.257223563,
LENGTHUNIT["metre",1]],
ID["EPSG",6326]],
PRIMEM["Greenwich",0,
ANGLEUNIT["Degree",0.0174532925199433]]],
CONVERSION["unnamed",
METHOD["Transverse Mercator",
ID["EPSG",9807]],
PARAMETER["Latitude of natural origin",1.36666666666667,
ANGLEUNIT["Degree",0.0174532925199433],
ID["EPSG",8801]],
PARAMETER["Longitude of natural origin",103.833333333333,
ANGLEUNIT["Degree",0.0174532925199433],
ID["EPSG",8802]],
PARAMETER["Scale factor at natural origin",1,
SCALEUNIT["unity",1],
ID["EPSG",8805]],
PARAMETER["False easting",28001.642,
LENGTHUNIT["metre",1],
ID["EPSG",8806]],
PARAMETER["False northing",38744.572,
LENGTHUNIT["metre",1],
ID["EPSG",8807]]],
CS[Cartesian,2],
AXIS["(E)",east,
ORDER[1],
LENGTHUNIT["metre",1,
ID["EPSG",9001]]],
AXIS["(N)",north,
ORDER[2],
LENGTHUNIT["metre",1,
ID["EPSG",9001]]]]
From the output messages, we can observe that the EPSG code for the data frames is currently 9001. This is wrong because the EPSG code of projection coordinate system SVY21 is supposed to be 3414, instead of 9001.
st_set_crs() of sf package is therefore used to assign the correct EPSG code for the data frames.
mrt_sf <- st_set_crs(mrt_sf, 3414)
bus_sf <- st_set_crs(bus_sf, 3414)
mpsz_sf <- st_set_crs(mpsz_sf, 3414)
We will now use st_crs() of sf package to retrieve the coordinate reference system again.
st_crs(mrt_sf)
Coordinate Reference System:
User input: EPSG:3414
wkt:
PROJCRS["SVY21 / Singapore TM",
BASEGEOGCRS["SVY21",
DATUM["SVY21",
ELLIPSOID["WGS 84",6378137,298.257223563,
LENGTHUNIT["metre",1]]],
PRIMEM["Greenwich",0,
ANGLEUNIT["degree",0.0174532925199433]],
ID["EPSG",4757]],
CONVERSION["Singapore Transverse Mercator",
METHOD["Transverse Mercator",
ID["EPSG",9807]],
PARAMETER["Latitude of natural origin",1.36666666666667,
ANGLEUNIT["degree",0.0174532925199433],
ID["EPSG",8801]],
PARAMETER["Longitude of natural origin",103.833333333333,
ANGLEUNIT["degree",0.0174532925199433],
ID["EPSG",8802]],
PARAMETER["Scale factor at natural origin",1,
SCALEUNIT["unity",1],
ID["EPSG",8805]],
PARAMETER["False easting",28001.642,
LENGTHUNIT["metre",1],
ID["EPSG",8806]],
PARAMETER["False northing",38744.572,
LENGTHUNIT["metre",1],
ID["EPSG",8807]]],
CS[Cartesian,2],
AXIS["northing (N)",north,
ORDER[1],
LENGTHUNIT["metre",1]],
AXIS["easting (E)",east,
ORDER[2],
LENGTHUNIT["metre",1]],
USAGE[
SCOPE["Cadastre, engineering survey, topographic mapping."],
AREA["Singapore - onshore and offshore."],
BBOX[1.13,103.59,1.47,104.07]],
ID["EPSG",3414]]
st_crs(bus_sf)
Coordinate Reference System:
User input: EPSG:3414
wkt:
PROJCRS["SVY21 / Singapore TM",
BASEGEOGCRS["SVY21",
DATUM["SVY21",
ELLIPSOID["WGS 84",6378137,298.257223563,
LENGTHUNIT["metre",1]]],
PRIMEM["Greenwich",0,
ANGLEUNIT["degree",0.0174532925199433]],
ID["EPSG",4757]],
CONVERSION["Singapore Transverse Mercator",
METHOD["Transverse Mercator",
ID["EPSG",9807]],
PARAMETER["Latitude of natural origin",1.36666666666667,
ANGLEUNIT["degree",0.0174532925199433],
ID["EPSG",8801]],
PARAMETER["Longitude of natural origin",103.833333333333,
ANGLEUNIT["degree",0.0174532925199433],
ID["EPSG",8802]],
PARAMETER["Scale factor at natural origin",1,
SCALEUNIT["unity",1],
ID["EPSG",8805]],
PARAMETER["False easting",28001.642,
LENGTHUNIT["metre",1],
ID["EPSG",8806]],
PARAMETER["False northing",38744.572,
LENGTHUNIT["metre",1],
ID["EPSG",8807]]],
CS[Cartesian,2],
AXIS["northing (N)",north,
ORDER[1],
LENGTHUNIT["metre",1]],
AXIS["easting (E)",east,
ORDER[2],
LENGTHUNIT["metre",1]],
USAGE[
SCOPE["Cadastre, engineering survey, topographic mapping."],
AREA["Singapore - onshore and offshore."],
BBOX[1.13,103.59,1.47,104.07]],
ID["EPSG",3414]]
st_crs(mpsz_sf)
Coordinate Reference System:
User input: EPSG:3414
wkt:
PROJCRS["SVY21 / Singapore TM",
BASEGEOGCRS["SVY21",
DATUM["SVY21",
ELLIPSOID["WGS 84",6378137,298.257223563,
LENGTHUNIT["metre",1]]],
PRIMEM["Greenwich",0,
ANGLEUNIT["degree",0.0174532925199433]],
ID["EPSG",4757]],
CONVERSION["Singapore Transverse Mercator",
METHOD["Transverse Mercator",
ID["EPSG",9807]],
PARAMETER["Latitude of natural origin",1.36666666666667,
ANGLEUNIT["degree",0.0174532925199433],
ID["EPSG",8801]],
PARAMETER["Longitude of natural origin",103.833333333333,
ANGLEUNIT["degree",0.0174532925199433],
ID["EPSG",8802]],
PARAMETER["Scale factor at natural origin",1,
SCALEUNIT["unity",1],
ID["EPSG",8805]],
PARAMETER["False easting",28001.642,
LENGTHUNIT["metre",1],
ID["EPSG",8806]],
PARAMETER["False northing",38744.572,
LENGTHUNIT["metre",1],
ID["EPSG",8807]]],
CS[Cartesian,2],
AXIS["northing (N)",north,
ORDER[1],
LENGTHUNIT["metre",1]],
AXIS["easting (E)",east,
ORDER[2],
LENGTHUNIT["metre",1]],
USAGE[
SCOPE["Cadastre, engineering survey, topographic mapping."],
AREA["Singapore - onshore and offshore."],
BBOX[1.13,103.59,1.47,104.07]],
ID["EPSG",3414]]
The EPSG code is now correctly assigned for all sf data frames!!
It is useful to plot a map to visualise the geospatial data, so that we can easily get a preliminary look at the spatial patterns.
tmap_mode('view')
tm_shape(mrt_sf) +
tm_dots(col="red") +
tm_shape(bus_sf) +
tm_dots(col="blue")
If we look closely, we can see that there are 5 bus stops outside of Singapore’s boundary (46211, 46219, 46239, 46609, 47701). As we are able to travel to and fro Johor Bahru with specific buses, there are designated bus stops located at Johor Bahru.
As such, we should remove these bus stops before proceeding with our analysis.
In this section, we will proceed to remove the bus stops identified earlier.
omit_bus <- list(46211, 46219, 46239, 46609, 47701)
bus_sf[bus_sf$BUS_STOP_N %in% omit_bus,]
Simple feature collection with 5 features and 3 fields
Geometry type: POINT
Dimension: XY
Bounding box: xmin: 17969.14 ymin: 49173.42 xmax: 20723.24 ymax: 52983.82
Projected CRS: SVY21 / Singapore TM
BUS_STOP_N BUS_ROOF_N LOC_DESC
765 47701 NIL JB SENTRAL
1532 46239 NA LARKIN TER
2257 46609 NA KOTARAYA II TER
2269 46211 NIL JOHOR BAHRU CHECKPT
4346 46219 NIL JOHOR BAHRU CHECKPT
geometry
765 POINT (20370.54 49173.42)
1532 POINT (17969.14 52983.82)
2257 POINT (20025.46 49261.6)
2269 POINT (20623.18 49199.99)
4346 POINT (20723.24 49200.05)
We can observe that the location descriptions of these bus stops indeed indicate that they are situated in Johor Bahru. We will therefore need to remove them.
bus_sf <- bus_sf[!bus_sf$BUS_STOP_N %in% omit_bus,]
# Check again if we have removed successfully
bus_sf[bus_sf$BUS_STOP_N %in% omit_bus,]
Simple feature collection with 0 features and 3 fields
Bounding box: xmin: NA ymin: NA xmax: NA ymax: NA
Projected CRS: SVY21 / Singapore TM
[1] BUS_STOP_N BUS_ROOF_N LOC_DESC geometry
<0 rows> (or 0-length row.names)
We will be making use of the package onemapsgapi to query the required data sets from OneMap API. We will then save these data sets in CSV format.
library(onemapsgapi)
token <- "eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJzdWIiOjc5NTUsInVzZXJfaWQiOjc5NTUsImVtYWlsIjoiZ2VuaWNlLmdvaC4yMDE4QHNjaXMuc211LmVkdS5zZyIsImZvcmV2ZXIiOmZhbHNlLCJpc3MiOiJodHRwOlwvXC9vbTIuZGZlLm9uZW1hcC5zZ1wvYXBpXC92MlwvdXNlclwvc2Vzc2lvbiIsImlhdCI6MTYzNTU3OTEzMywiZXhwIjoxNjM2MDExMTMzLCJuYmYiOjE2MzU1NzkxMzMsImp0aSI6ImIzZmEyOTg1MDQyZTFiOGY0MWYzMWJlNzFmYzdlZTEwIn0.z2gh5tK3ezz0lUXLoPilPVOfN0MaXJm4N4RbAtVIQFI"
data <- list("eldercare", "hawkercentre", "relaxsg", "supermarkets", "kindergartens", "childcare")
for (d in data) {
df <- get_theme(token, d)
write_csv(df, str_replace("data/aspatial/df.csv", "df", d))
}
read_csv() of readr package is used to import the CSV files. The outputs are tibble data frames.
eldercare = read_csv("data/aspatial/eldercare.csv")
hawker = read_csv("data/aspatial/hawkercentre.csv")
park = read_csv("data/aspatial/relaxsg.csv")
supermarket = read_csv("data/aspatial/supermarkets.csv")
kindergarten = read_csv("data/aspatial/kindergartens.csv")
childcare = read_csv("data/aspatial/childcare.csv")
school = read_csv("data/aspatial/schools.csv")
mall = read_csv("data/aspatial/shopping-mall.csv")
flat_resale = read_csv("data/aspatial/resale-flat-prices.csv")
It is also important to understand the data that we are working with. glimpse() of dplyr package is therefore used to perform exploratory data analysis.
glimpse(eldercare)
glimpse(hawker)
glimpse(park)
glimpse(supermarket)
glimpse(kindergarten)
glimpse(childcare)
glimpse(school)
glimpse(mall)
glimpse(flat_resale)
We will proceed to prepare the data such that they can be used later for the preparation of independent variables. The steps taken include:
# remove redundant columns for eldercare dataset
eldercare <- eldercare %>%
select(c(1, 4:5))
# remove redundant columns for hawker dataset
hawker <- hawker %>%
select(c(1, 14:15))
# remove redundant columns for park dataset
park <- park %>%
select(c(1, 8:9))
# remove redundant columns for supermarket dataset
supermarket <- supermarket %>%
select(c(1, 7:8))
# remove redundant columns for kindergarten dataset
kindergarten <- kindergarten %>%
select(c(1, 5:6))
# remove redundant columns for childcare dataset
childcare <- childcare %>%
select(c(1, 5:6))
# filter out primary schools; remove redundant columns
prischool <- school %>%
filter(mainlevel_code == "PRIMARY") %>%
select(c(1, 3:4, 25,27))
# filter out 4-room flats with transaction period from 01-Jan-19 to 30-Sep-20
flat_resale <- flat_resale %>%
filter(flat_type == "4 ROOM") %>%
filter(month >= "2019-01" & month < "2020-10")
We will need to conduct data preprocessing to ensure that there are no NA values in all data frames.
eldercare[rowSums(is.na(eldercare))!=0,]
hawker[rowSums(is.na(hawker))!=0,]
park[rowSums(is.na(park))!=0,]
supermarket[rowSums(is.na(supermarket))!=0,]
kindergarten[rowSums(is.na(kindergarten))!=0,]
childcare[rowSums(is.na(childcare))!=0,]
prischool[rowSums(is.na(prischool))!=0,]
mall[rowSums(is.na(mall))!=0,]
flat_resale[rowSums(is.na(flat_resale))!=0,]
There are no NA values in all the data frames.
After exploratory data analysis performed earlier, it is found that prischool and flat_resale data frames do not have Lat and Lng columns. These columns are required for the preparation of the independent variables later. We therefore need to use OneMap API to geocode the coordinates columns.
We first need to create an address column for the flat_resale data frame since the address data is currently split into 2 columns block and street name respectively.
unite() of dpylr package is used to concatenate the 2 columns.
flat_resale <- flat_resale %>%
unite("address", block:street_name, sep= " ")
We alo observe that some addresses in the the flat_resale data frame start with “ST.”. It is found later that this will pose problems during geocoding. As a result, we will need to rename “ST.” to its full version “SAINT”.
flat_resale$address <- gsub("ST\\.", "SAINT", flat_resale$address)
In this function, the input variable address is passed in as a search value in the query to the API. The output is then converted to a data frame, where we only choose to retain the Latitude and Longitude columns. These 2 columns are returned as the output.
geocode <- function(address) {
url <- "https://developers.onemap.sg/commonapi/search"
query <- list("searchVal" = address, "returnGeom" = "Y",
"getAddrDetails" = "N", "pageNum" = "1")
res <- GET(url, query = query, verbose())
output <- content(res) %>%
as.data.frame %>%
select(results.LATITUDE, results.LONGITUDE)
return(output)
}
We will now loop through each row of the prischool data frame and implement the geocode function. The output is saved as 2 new columns Lat and Lng.
prischool$Lat <- 0
prischool$Lng <- 0
for (i in 1:nrow(prischool)) {
output <- geocode(prischool$postal_code[i])
prischool$Lat[i] <- output$results.LATITUDE
prischool$Lng[i] <- output$results.LONGITUDE
}
We will now loop through each row of the flat_resale data frame and implement the geocode function. The output is saved as 2 new columns Lat and Lng.
flat_resale$Lat <- 0
flat_resale$Lng <- 0
for (i in 1:nrow(flat_resale)) {
output <- geocode(flat_resale$address[i])
flat_resale$Lat[i] <- output$results.LATITUDE
flat_resale$Lng[i] <- output$results.LONGITUDE
}
We need to conduct dummy coding on the storey_range variable for us to use it in model later. We will first look at the individual storey-range values for us to get a rough idea on the number of columns that will be produced.
unique(flat_resale$storey_range)
We observed that there are 17 storey range categories.
To conduct dummy coding, we will be using pivot_wider() of dplyr package to create duplicate variables representing every store-range. If the obeservation belongs to the storey-range, the value will be 1. The value will be 0 otherwise.
flat_resale <- flat_resale %>%
pivot_wider(names_from = storey_range, values_from = storey_range,
values_fn = list(storey_range = ~1), values_fill = 0)
We need to convert the remaining_lease column from string to numeric format to use in the model later. We will first split the string within the column and take the value(s) of the year and month. We will then calculate the remaining lease in years and replace the original value.
str_list <- str_split(flat_resale$remaining_lease, " ")
for (i in 1:length(str_list)) {
if (length(unlist(str_list[i])) > 2) {
year <- as.numeric(unlist(str_list[i])[1])
month <- as.numeric(unlist(str_list[i])[3])
flat_resale$remaining_lease[i] <- year + round(month/12, 2)
}
else {
year <- as.numeric(unlist(str_list[i])[1])
flat_resale$remaining_lease[i] <- year
}
}
We need to filter out good primary schools and save it as variable goodprischool for it to be used in the model. Good primary schools are defined to be Special Assistance Programme (SAP) primary schools or primary schools with the Gifted Education Programme (GEP).
gdprischool <- prischool %>%
filter(sap_ind == "Yes" | gifted_ind == "Yes")
The Central Business District (CBD) is in Downtown Core, located in the southwest part of Singapore. We will therefore take the coordinates of Downtown Core to be the coordinates of the CBD and convert it into a sf data frame.
lat <- 1.287953
long <- 103.851784
cbd_sf <- data.frame(lat, long) %>%
st_as_sf(coords = c("long", "lat"), crs=4326) %>%
st_transform(crs=3414)
We need to convert the data sets into sf data frames for us to calculate the proximity distance matrices.
flat_resale_sf <- st_as_sf(flat_resale,
coords = c("Lng", "Lat"), crs=4326) %>%
st_transform(crs = 3414)
eldercare_sf <- st_as_sf(eldercare,
coords = c("Lng", "Lat"), crs=4326) %>%
st_transform(crs = 3414)
hawker_sf <- st_as_sf(hawker,
coords = c("Lng", "Lat"), crs=4326) %>%
st_transform(crs = 3414)
park_sf <- st_as_sf(park,
coords = c("Lng", "Lat"), crs=4326) %>%
st_transform(crs = 3414)
gdprischool_sf <- st_as_sf(gdprischool,
coords = c("Lng", "Lat"), crs=4326) %>%
st_transform(crs = 3414)
mall_sf <- st_as_sf(mall,
coords = c("longitude", "latitude"), crs=4326) %>%
st_transform(crs = 3414)
supermarket_sf <- st_as_sf(supermarket,
coords = c("Lng", "Lat"), crs=4326) %>%
st_transform(crs = 3414)
kindergarten_sf <- st_as_sf(kindergarten,
coords = c("Lng", "Lat"), crs=4326) %>%
st_transform(crs = 3414)
childcare_sf <- st_as_sf(childcare,
coords = c("Lng", "Lat"), crs=4326) %>%
st_transform(crs = 3414)
prischool_sf <- st_as_sf(prischool,
coords = c("Lng", "Lat"), crs=4326) %>%
st_transform(crs = 3414)
This function computes the distance to the nearest facility. st_distance() of sf package is used to compute the distance between all flats and the respective facilities. rowMins() of matrixStats is then used to take the shortest distance within each row in the output matrix. The values will be appended to the data frame as a new column.
proximity <- function(df1, df2, varname) {
matrix <- st_distance(df1, df2) %>%
drop_units()
df1[,varname] <- rowMins(matrix)
return(df1)
}
We will now implement the proximity function to the required variables.
flat_resale_sf <- proximity(flat_resale_sf, cbd_sf, "PROX_CBD") %>%
proximity(., eldercare_sf, "PROX_EC") %>%
proximity(., hawker_sf, "PROX_HA") %>%
proximity(., mrt_sf, "PROX_MRT") %>%
proximity(., park_sf, "PROX_PARK") %>%
proximity(., gdprischool_sf, "PROX_GDPRI") %>%
proximity(., mall_sf, "PROX_MALL") %>%
proximity(., supermarket_sf, "PROX_SUP")
This function computes the facility count within a radius. Similarly, st_distance() of sf package is used to compute the distance between all flats and the respective facilities. rowSums() of dplyr is then used to count the observations with distances below the defined radius. The values will be appended to the data frame as a new column.
num_radius <- function(df1, df2, varname, radius) {
matrix <- st_distance(df1, df2) %>%
drop_units() %>%
as.data.frame()
df1[,varname] <- rowSums(matrix <= radius)
return(df1)
}
We will now implement the num_radius function to the required variables.
flat_resale_sf <- num_radius(flat_resale_sf, kindergarten_sf,
"NUM_KIN", 350) %>%
num_radius(., childcare_sf, "NUM_CC", 350) %>%
num_radius(., bus_sf, "NUM_STOP", 350) %>%
num_radius(., prischool_sf, "NUM_PRI", 1000)
Before saving the dataset, we will remove the redundant columns, rename the columns to shorter forms and relocate the price columns to the front of the data frame.
flat_resale_sf <- flat_resale_sf %>%
select(5, 8, 9, 10:39) %>%
rename("AREA_SQM" = "floor_area_sqm", "LEASE_YRS" = "remaining_lease",
"PRICE"= "resale_price") %>%
relocate(`PRICE`)
We can now save the final flat resale price data set as a SHP file using st_write of sf package.
st_write(flat_resale_sf, "data/geospatial/resale-flat-prices-final.shp")
st_read() of sf package is used to import the final data set, which is in shapefile format.
flat_resale_sf <- st_read(dsn="data/geospatial",
layer="resale-flat-prices-final")
Reading layer `resale-flat-prices-final' from data source
`C:\Users\user\Desktop\SMU\Y4S1\IS415\geniceee\IS415_blog\_posts\2021-11-05-take-home-exercise-3\data\geospatial'
using driver `ESRI Shapefile'
Simple feature collection with 15853 features and 32 fields
Geometry type: POINT
Dimension: XY
Bounding box: xmin: 11597.31 ymin: 28217.39 xmax: 42623.63 ymax: 48741.06
Projected CRS: SVY21 / Singapore TM
We can plot the distribution of PRICE using a histogram as shown below.
ggplot(data=flat_resale_sf, aes(x=`PRICE`)) +
geom_histogram(bins=20, color="black", fill="light blue")

The figure above reveals a slightly right skewed distribution. This means that more HDB flats were transacted at relative lower prices.
Statistically, the skewed distribution can be normalised using log transformation. mutate() of dplyr package is used to derive a new variable called LOG_PRICE using a log transformation on the variable PRICE.
flat_resale_sf <- flat_resale_sf %>%
mutate(`LOG_PRICE` = log(PRICE)) %>%
relocate(`LOG_PRICE`, .after = `PRICE`)
We can now plot the distribution of LOG_PRICE.
ggplot(data=flat_resale_sf, aes(x=`LOG_PRICE`)) +
geom_histogram(bins=20, color="black", fill="light blue")

It is found that the LEASE_YRS column is still in string format. We will convert it to numeric format for us to conduct EDA and to input into the model later.
flat_resale_sf$LEASE_YRS <- as.numeric(flat_resale_sf$LEASE_YRS)
We will now draw multiple histograms to show multiple distributions of the independent variables. This is done using ggarrange() of ggpubr package.
AREA_SQM <- ggplot(data=flat_resale_sf, aes(x= `AREA_SQM`)) +
geom_histogram(bins=20, color="black", fill="light blue")
LEASE_YRS <- ggplot(data=flat_resale_sf, aes(x= `LEASE_YRS`)) +
geom_histogram(bins=20, color="black", fill="light blue")
PROX_CBD <- ggplot(data=flat_resale_sf, aes(x= `PROX_CBD`)) +
geom_histogram(bins=20, color="black", fill="light blue")
PROX_EC <- ggplot(data=flat_resale_sf, aes(x= `PROX_EC`)) +
geom_histogram(bins=20, color="black", fill="light blue")
PROX_HAWKER <- ggplot(data=flat_resale_sf, aes(x= `PROX_HA`)) +
geom_histogram(bins=20, color="black", fill="light blue")
PROX_MRT <- ggplot(data=flat_resale_sf, aes(x= `PROX_MRT`)) +
geom_histogram(bins=20, color="black", fill="light blue")
PROX_PARK <- ggplot(data=flat_resale_sf, aes(x= `PROX_PARK`)) +
geom_histogram(bins=20, color="black", fill="light blue")
PROX_GDPRI <- ggplot(data=flat_resale_sf, aes(x= `PROX_GDPRI`)) +
geom_histogram(bins=20, color="black", fill="light blue")
PROX_MALL <- ggplot(data=flat_resale_sf, aes(x= `PROX_MALL`)) +
geom_histogram(bins=20, color="black", fill="light blue")
PROX_SUP <- ggplot(data=flat_resale_sf, aes(x= `PROX_SUP`)) +
geom_histogram(bins=20, color="black", fill="light blue")
NUM_KIN <- ggplot(data=flat_resale_sf, aes(x= `NUM_KIN`)) +
geom_histogram(bins=20, color="black", fill="light blue")
NUM_CC <- ggplot(data=flat_resale_sf, aes(x= `NUM_CC`)) +
geom_histogram(bins=20, color="black", fill="light blue")
NUM_STOP <- ggplot(data=flat_resale_sf, aes(x= `NUM_STOP`)) +
geom_histogram(bins=20, color="black", fill="light blue")
NUM_PRI <- ggplot(data=flat_resale_sf, aes(x= `NUM_PRI`)) +
geom_histogram(bins=20, color="black", fill="light blue")
ggarrange(AREA_SQM, LEASE_YRS, PROX_CBD, PROX_EC, PROX_HAWKER, PROX_MRT, PROX_PARK, PROX_GDPRI, PROX_MALL, PROX_SUP, NUM_KIN, NUM_CC, NUM_STOP, NUM_PRI, ncol = 3, nrow = 5)

We can observe that the distribution of the majority of the independent variables are right skewed, while variables such as LEASE_YRS and PROX_CBD are slightly left skewed.
We want to reveal the geospatial distribution flat resale prices (i.e. dependent variable) in Singapore using the tmap package.
tmap_mode("view")
tm_shape(mpsz_sf)+
tm_polygons() +
tm_shape(flat_resale_sf) +
tm_dots(col = "PRICE",
alpha = 0.6,
style="quantile") +
tm_view(set.zoom.limits = c(11,14))
From the map, we can observe that flats with higher resale prices are more concentrated in the central area of Singapore.